Add more tool call integration tests and improve integration test infrastructure by atdrendel · Pull Request #142 · ml-explore/mlx-swift-lm

atdrendel · 2026-03-11T12:13:56Z

Proposed changes

I've noticed that tool calling doesn't always get tested when adding support for new models to mlx-swift-lm. Additionally, modifying the tool calling code is, at least to me, a bit stressful because we don't have robust tool calling integration tests proving code changes don't break tool calling for older models.

This pull request hopes to start improving our support for tool call testing going forward by:

Adding tool call integration tests for Qwen3.5 and Nemotron
Extending IntegrationTestModels to support the new models being tested
Using IntegrationTestModels inside of ToolCallIntegrationTests to allow for lazy loading of the models (meaning that developers who want to add a new model can just download the model they need if they only want to run the integration tests for that single model)

Note: The Nemotron integration tests are skipped by default because the model is huge. However, enabling the tests involves just swapping false for true in XCSkipIf(). This seems like a sensible choice to me because of the size of the model, but I'd be curious what others think.

Here are a few of the recent pull requests made to address broken tool calling in some popular models:

Many models: add missing context/toolcall parameters #140
Qwen3.5: Fixed tool calling for qwen3.5 #133
LFM2.5 VL: Fix LFM2.5 VL tools #139
Mistral: Fix tool calling for Mistral 3 #132

Checklist

Put an x in the boxes that apply.

I have read the CONTRIBUTING document
I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
I have added tests that prove my fix is effective or that my feature works
I have updated the necessary documentation (if needed)

Tests/MLXLMIntegrationTests/ToolCallIntegrationTests.swift

DePasqualeOrg · 2026-03-11T15:20:33Z

In #118, all of the integration tests were moved to a separate module that can be imported to run in the adapter packages. So either this gets merged first and the new/revised tests need to be re-implemented there, or that gets merged first and the tests need to be implemented in the new way.

atdrendel · 2026-03-11T15:24:38Z

I'm happy to have #118 get merged first, and then I'll re-add these tests in the correct spot.

Copilot

Pull request overview

Refactors MLXLM integration tests to use a shared async model-loading actor (instead of per-test-suite static state) and expands tool-call integration coverage to additional models.

Changes:

Centralized model IDs and added cached async container loaders in IntegrationTestModels.
Updated tool-call integration tests to lazily load model containers via IntegrationTestModels with skip-on-unavailable behavior.
Added new Nemotron and Qwen3.5 tool-call auto-detection and end-to-end parsing tests.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
Tests/MLXLMIntegrationTests/ToolCallIntegrationTests.swift	Reworked model loading to use shared async loaders; added Nemotron + Qwen3.5 tool-call tests.
Tests/MLXLMIntegrationTests/IntegrationTestModels.swift	Added additional model IDs and per-model cached loading methods (via stored `Task`s) for integration tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Tests/MLXLMIntegrationTests/ToolCallIntegrationTests.swift

atdrendel commented Mar 11, 2026

View reviewed changes

Tests/MLXLMIntegrationTests/ToolCallIntegrationTests.swift Outdated Show resolved Hide resolved

atdrendel added 5 commits March 12, 2026 22:46

Add Nemotron tool integration test

a264943

Use IntegrationTestModels inside of ToolCallIntegrationTests

96cde53

Skip Nemotron tests in ToolCallIntegrationTests by default

d6bd700

Add Qwen3.5 tool call integration tests

487cec6

Disable Nemotron thinking because it uses way too many tokens to think

877cf57

atdrendel force-pushed the pr-133-qwen3.5-tool-calling branch from f888b37 to 877cf57 Compare March 12, 2026 21:48

atdrendel marked this pull request as ready for review March 12, 2026 21:51

Copilot AI review requested due to automatic review settings March 12, 2026 21:51

Copilot started reviewing on behalf of atdrendel March 12, 2026 21:52 View session

Copilot AI reviewed Mar 12, 2026

View reviewed changes

Tests/MLXLMIntegrationTests/ToolCallIntegrationTests.swift Outdated Show resolved Hide resolved

Remove XCTSkipIf()

a0bdee5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more tool call integration tests and improve integration test infrastructure#142

Add more tool call integration tests and improve integration test infrastructure#142
atdrendel wants to merge 6 commits intoml-explore:mainfrom
shareup:pr-133-qwen3.5-tool-calling

atdrendel commented Mar 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

DePasqualeOrg commented Mar 11, 2026

Uh oh!

atdrendel commented Mar 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

atdrendel commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Here are a few of the recent pull requests made to address broken tool calling in some popular models:

Checklist

Uh oh!

Uh oh!

DePasqualeOrg commented Mar 11, 2026

Uh oh!

atdrendel commented Mar 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

atdrendel commented Mar 11, 2026 •

edited

Loading